[SPARK-8300] DataFrame hint for broadcast join.#6751
[SPARK-8300] DataFrame hint for broadcast join.#6751rxin wants to merge 3 commits intoapache:masterfrom
Conversation
|
@rxin - This is great. I'm a fan of the alternative syntax we chatted about: |
|
Test build #34641 has finished for PR 6751 at commit
|
There was a problem hiding this comment.
Could we just have this be StatisticsHint and override the statistics? I'm afraid that we are going to forget to add cases in the future as we broadcast more type of joins.
There was a problem hiding this comment.
I'm thinking about this more and actually maybe what we want to do is have this specific node, and a pattern that recognizes canBroadcast. This pattern can check either for a small enough size or this hint and we can use that anywhere we are planning a broadcast operator. The reasoning is that messing with statistics could have other consequences.
There was a problem hiding this comment.
+1
Add specific node is also doable, probably the node can be named like Hint in LogicalPlan, and could be more than broadcast, like uniq_key etc.
Any idea?
|
Regarding |
|
We should add isNotNull too. The isNotNull was from SchemaRDD. |
|
I'm confused now: both |
|
I meant adding isNotNull as a function, rather than a member method of Column. Basically most of those Column.function were inherited from SchemaRDD time. |
|
Test build #35520 has finished for PR 6751 at commit
|
|
Merging to master! |
|
How can this hint be used with Spark-SQL ? Below example does not seem to work.
|
|
It can only be used from the dataframe API.
|
|
Using broadcast(...) results in "java.lang.AssertionError: assertion failed: No plan for BroadcastHint" as in https://issues.apache.org/jira/browse/SPARK-12275 so this solution is not working most of the time |
|
Isn't the hint available in SQL? |
|
Its available from spark 2.2.0.
On Tuesday, October 10, 2017, 1:46:27 PM PDT, Reynold Xin <notifications@github.com> wrote:
Isn't the hint available in SQL?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Is there an example?
|
|
Let me know if this does not help
[SPARK-16475] Broadcast Hint for SQL Queries - ASF JIRA
|
|
| |
[SPARK-16475] Broadcast Hint for SQL Queries - ASF JIRA
|
|
|
On Tuesday, October 10, 2017, 6:25:57 PM PDT, fjh100456 <notifications@github.com> wrote:
Is there an example?
I use broadcast like the following, but it perform an error.Would you be so kind as to show me an example?
spark-sql> select a.* from tableA a left outer join broadcast(tableB) b on a.a=b.a;
17/10/11 09:06:40 INFO HiveMetaStore: 0: get_table : db=default tbl=tablea
17/10/11 09:06:40 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=tablea
Error in query: cannot resolve 'tableB' given input columns: []; line 1 pos 51;
'Project [ArrayBuffer(a).*]
+- 'Join LeftOuter, ('a.a = 'b.a)
:- SubqueryAlias a
: +- SubqueryAlias tablea
: +- HiveTableRelation default.tablea, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#94, b#95, c#96]
+- 'SubqueryAlias b
+- 'UnresolvedTableValuedFunction broadcast, ['tableB]
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
With |
Users can now do
left.join(broadcast(right), "joinKey")to give the query planner a hint that "right" DataFrame is small and should be broadcasted.